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Retinol binding protein can be constructed from a small 
number of large substructures taken from three unrelated 
proteins. The known structures are treated as a knowledge 
base from which one extracts information to be used in 
molecular modelling when lacking true atomic resolution. This 
includes the interpretation of electron density maps and 
modelling homologous proteins. Models can be built into maps 
more accurately and more quickly. This requires the use of 
a skeleton representation for the electron density which im- 
proves the determination of the initial chain tracing. Frag- 
ment-matching can be used to bridge gaps for inserted 
residues when modelling homologous proteins. 
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Introduction 

Single-crystal X-ray diffraction is still the most powerflil method 
for obtaining the three-dimensional structure of a protein 
molecule. Although many technical improvements have been 
made since the structure of myoglobin was determined by Ken- 
drew et ai (1960), the interpretation of the experimentally deter- 
mmed electron density map remains a difficult point in the 
process. For many reasons these maps are rarely of sufficient 
quality to show individual atoms and will often contain breaks 
in main chain density. 

Map interpretation has been made easier by the fact that pro- 
teins contain significant amounts of regular structure such as a- 
helices and /?-strands, predicted by Pauling and co-workers (Paul- 
ing and Corey, 1951; Pauling et ai, 1951), and standard turns 
(Venkatachalam, 1968; see Richardson, 1981, for an extensive 
review). A model widi the correct conformation can then be made 
and fitted to even a relatively poor piece of density. Such a model 
may be made of wire (Richards, 1968) or generated in a com- 
puter graphics system (Jones, 1982)". 

This building-block approach to protein modelling can be ex- 
panded to include all fragments making up the molecule. We 
show that a protein can be constructed from large fragments of 
just a few proteins. Such substructures can also be easily match- 
ed to a suitable representation of the electron density, e.g. to the 
skeletonized density of Greer (1974). Because many errors and 
ambiguities exist in such a skeleton, extensive adjustments may 
be required and we have therefore implemented these methods 
in the interactive graphics program, FRODO (Jones, 1978, 1982, 
1985). 

Although we emphasize the use of fragment-fitting in protein 
crystallography, the technique is useful in a number of model- 
ling activities involving non-atomic resolution data. This includes 
modelling homologous proteins where it can suggest a number 



of possible conformations, and in n.m.r. spectroscopy where 
fragments can be located to satisfy local interatomic distance 
measureiment (Kraulis and Jones, in preparation). 

Results and Discussion 

Searching for fragments of similar structure 
The main domain of retinol binding protein (RBP) consists of 
an unusual eight-stranded up-and-down /^-barrel that encapsulates 
the retinol molecule (Newcomer et ai , 1984). There are seven 
reverse turns between these jS-strands, two of which, residues 
48-52 and 124-128, have a similar but unusual main chain 
hydrogen bonding scheme. 

In this scheme carbonyl oxygen Oj forms a hydrogen bond 
with peptide nitrogen Ni+3, and Nj forms one with Oi+4. In both 
turns residue i+3 is a glycine, and the main chain torsion angles 
correspond to a type I turn (Venkatachalam, 1968) followed by 
a bulge (Richardson et ai, 1978), We found that the relevant 
Ca atoms of 48-52 can be matched to 124-128 with a root 
mean square (r.m.s.) deviation of only 0.23 A (Figure 1). This 
turn conformation had not been identified as a standard substruc- 
ture in the review by Richardson (1981) but its frequency in RBP 
suggested that it may be a common template in anti-parallel 
strands. To test this hypothesis, we searched the entire protein 
data bank (Bernstein et ai, 1977) and found this substructure 



Table I. Results of building RBP from three other proteins 

RBP residue Matching;/ Protein resi- R.m.s. devia- 

_^ Protein due number tions (A) 



f 



'I 

I 



4-11 


HCAC 


30 


0.64 


12-17 


ADH 


77 


1.02 


18-21 


HCAC 


130 


0.04 


22-33 


HCAC 


204 


1.28 


34-39 


ADH 


9 


0.39 


40-47 


STNV 


159 


0.67 


48-52 


ADH 


123 


0.29 


53-62 


STNV 


132 


0.56 


63-67 


ADH 


308 


0.34 


68-78 


STNV 


69 


1.04 


79-86 


HCAC 


94 


1.02 


87-92 


ADH 


312 


0.24 


93-97 


ADH 


110 


0.26 


98-106 


ADH 


22 


0.91 


107-114 


STNV 


122 


1.28 


114-121 


STNV 


61 


0.54 


122-128 


ADH 


121 


0.66 


129-139 


STNV 


56 


1.03 


140-144 


STNV 


28 


0.31 


145-161 


ADH 


323 


0.79 


161-167 


HCAC 


133 


. 0.80 


168-173 


HCAC 


56 


0.53 



The matching protein residue number is the internal residue count of the 
first residue in the matching zone. As such, it is not necessarily the same 
the residue name. The r.m.s. deviation is the result of a least squares fit. 
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Rg. 1, Matching the reverse turn near RBP residue 50. The backbone 
atoms of residues 48-52 are shown in an atom colouring convention 
(carbons are yellow, nitrogens are blue and oxygens arc red). The green 
fragment is RBP residues 124-128 and matches with a deviation of 
0.23 A, 




Fig. 2. Searching for a match to all of strand A in RBP, including the 
julge at residue 27. The green fragment is from carbonic anhydrase C. 

n 23 diiferent proteins using an r.m.s. cut-off of 0,5 A for mat- 
:hing all main chain atoms. The same substructure has been 
ecently identified by Sibanda and Thornton (1985) from an ex- 
ensive study of turns between anti-parallel strands. 

The ease with which we found an unknown conformation 
irompted the question of whether, indeed, any part of RBP was 
inique. The matching fragments shown in Table I were obtain- 
d by trial and error at the display from the refined coordinates 
tf only three proteins: satellite tobacco necrosis virus, STNV 
Jones and Liljas, 1984b), apo-alcohol dehydrogenase, ADH 
Jones and Eklund, in preparation) and human carbonic anhydrase 

HCAC (T.A.Jones, E.Eriksson and A. Liljas, in preparation), 
f a fragment matched, the region was extended and the match 
epeated. A large substructure is shown in Figure 2. After 
sbuilding the whole molecule, it was regularized to remove the 
iscontinuities that had been introduced between the fragments, 
lie main chain r.m.s. deviation to the starting model was 1 .0 A . 
;onsidering we used fragments from only three proteins and that 
^e located and combined them in a very simple way, this is a 
jrprising result both in terms of the goodness-of-fit and in the 
umber of fragments used. 
20 



Fig. 3. Portion of the 3.1 A electron density map of RBP with a calculated 
skeleton. The skeleton atoms have been automatically classified as main 
chain (lilac) and side chain (green). 

A more elegant method of finding the best set of fragments 
has been suggested (M.Levitt, private communication) that uses 
a dynamic programming algorithm similar to that employed in 
sequence comparisons (Neediemann and Wunsch, 1970). This 
finds the minimum number of fragments required to build the 
structure where each fragment matches the structure to within 
a pre-set limit. With the criterion that Cas are matched to within 
1 A r.m.s., this method builds RBP from 15 fragments, and 
with a 0,5 A cut-off it requires 20 fragments. 
Protein crystallographic applications 

The construction of an initial model from an electron density map 
is frequently a difficult task for even highly trained scientists. 
The process is usually complicated by lack of resolution in the 
X-ray data, lack of isomorphism in the heavy atom derivatives 
and sometimes by lack of an amino acid sequence. The crystallo- 
grapher is faced with long range problems such as getting the 
correct chain tracing, and local problems such as the correct 
orientation of peptide planes. Jones (1982) showed that even sim- 
ple peptide plane errors could not be automatically removed by 
refinement programs. If the rest of the model is sufficiently ac- 
curate, maps calculated with phases obtained from the model 
coordinates can be inspected to locate and correct errors (Jones, 
1982). More usually in crystallographic refinement, the modei 
(and hence the phases) gradually improves but requires many 
cycles of model refitting and refinement (Remington et al , 1982). 
There are even reports that removing incorrect parts of the struc- 
ture from the phases can still leave a ghost of the incorrect struc- 
ture in the map (Finzel et ai , 1984). It is therefore of great benefit 
to start with as accurate a structure as possible. 

Various methods have been used to build models into maps 
with computer graphics. Our experience with the program 
FRODO (Jones, 1978, 1982, 1985) suggests that it is first 
necessary to determine the protein fold from contoured mini-maps 
drawn on plastic sheets. If secondary structure elements are 
recognized they can be constructed and fitted as rigid groups to 
a set of rough atomic guide points. Any gaps can be filled in 
based on a few guide points per residue.This produces a rough 
starting model whose main chain follows the trace determined 
on the mini-map. It then requires refitting to closely match the 
density, possibly using automated techniques (Jones and Lihas, 
1984a). 

A different approach has been suggested by Greer (1974) that 
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Fig. 4. The skeleton of Figure 3 has been re-defined to suggest a different 
ch^n T"^-^ hy^*esis. the omnge line represents Ae new ^ 
chain. A number of bonds have been nu.de and broken to satisfy this 

attempts to automate model building. The first step in this pro- 

^t?,h', r.f '"t ''f "^^"''"y ""P '° ^ ^« of connected 
pomts that follow the density. This skeletal representation makes 
It much easier to recognize branch points that may represent side 
chains or, at higher resolution, carbonyl groups. The method was 
r !l''m7T^,*L"fi 2 A and 3 A maps of known stmctures 
(Greer. 1974 1976) and could produce provisional main chain 
coordinates. The many errors usually present in skeleton con- 
nectivity probably explain why the method has not been widely 

A different skeletonization algorithm using critical point net- 
works (Johnson, 1978) has been combined with computer 
graphics by Pique (1984) and co-workers. 

nfilp" ^- 'r-^ """"'P'^ isomorphous replacement map 
of RBP IS shown m Figure 3. This shows the usual electron den 
sity contours with a Greer skeleton calculated from the map. The 
skeleton colouring shows a calculated main chain/side chain 
assignment made according to the lengths of connected pieces 
ihis region corresponds to the tripeptide 133-135 (Tyr- 
nrJr,- r^M, "^^^ "^'''"S point of our initial map inter- 
pretation. It Illustrates the problems associated with low resolu- 
tion maps: the main chain skeleton shows a break due to a local 

chl°Tif '''"'"y* ""'^ ' P^'^ ''y'l^ogen bonding side 
chains (tfie serine and a tyrosine that is not drawn) form a con- 
tinuous density that is assigned main chain status. Figure 4 shows 
the same region after interactively locating and correcting these 
^ ^''^'n ^"^^'^ton as an acceptable 
race Figures 3 and 4 show the important use of colour to il- 
lustrate the current skeleton assignments 

^'^fients from known stnicnires can be matched to the skele- 
rkeleln^nw' ^^^"^^^ Petitioning putative Ca atoms along the 
nefiuZ r'", ^"toniatically (with restraints that 

TnafS ,T ^ apart) or by explicidy defin- 

ing a skeleton point to be a Ca atom. To ensure a gtwd and quick 
mateh one usually fits a length of skeleton corrfspondingTa 
^gment of 5-7 residues. Adjacent fragments are often overlap- 
ped by one residue because the carbonyl group of the last residue 
■n a fragment plays no role in the matching 

din.'^nT«o'"'"'"f "'^ «he region correspon- 

ding to RBP residues 1 16-136 (Figure 5), givis a model ^th 
an r.m.s. deviation to the final refined coordinates of 0.95 A 



FTg^ 5. The main chain skeleton (in orange) has been ftagmenl-fitted (in 
green) five residues at a time and with one residue overlap. The final 




the fragment search. The coloured traces are the 20 best fits to the 
remaming four residues. The correct RBP chain is a member of the 
dominant cluster of 14 traces. 

for all main chain atoms. Our original MIR model was the insult 
of many hours of carefiil fitting, but has a significantly higher 
r.m_s. deviation of 1.30 A to the final coordinates 

The skeleton has die added advantage of giving an overall view 
of the density for which one previously relied on mini-maps 
However, it is a great improvement since it can be easily chang- 
ed, saved, restored and viewed from any direction. 
Model building homologous structures 
The number of known protein sequences far exceeds the number 
of known structures. Thus, for each newly determined structure 
there is usually at least one other protein with some sequence • 
homology. For example, the Escherichia coli DNA polymerase 
1 Wenow fragment (OIlis et ai, 1985a) could immediately be 
used to model T7 DNA polymerase (OIlis et ai , 1985b). Model 
building homologous proteins relies on defining structurally con- 
served and variable regions (Greer, 1981). Amino acid muta- 
tions '"conserved regions are easily carried out with programs 
such as FRODO. Insertions and deletions in the variable regions 
are much more difficult to model. In these regions we are fre- 



821 



T.A Joaes and S.Thirup 



quently fiaced with the problem: how does one go between two 
points in space using a certain number of amino acids? 

Substructure matching is able to provide some answers to this 
question. By way of illustration, we shall again refer to loop 
47—53 in RBP. When a search is made for this loop among the 
coordinates of 37 highly refined proteins, all of the 20 best mat- 
ches have the RBP conformation. Excluding Gly 51 from the 
search also gives 20 essentially identical traces. Excluding 
residues 49-51 gives the set of Ca traces shown in Figure 6. 
Fourteen of the 20 traces are similar to the loop observed in RBP 
and of these, seven had a glycine equivalent to Gly 51. A se- 
cond cluster of three conformations is also apparent. We are cur- 
rently investigating various length loops and deletions to more 
accurately determine the probability of identifying the correct 
substructure. 

Conclusions 

Our initial experiments suggest that proteins can be constructed 
from large building blocks whose exact size and number remain 
to be determined. 

We have extracted from the protein data bank the best refined 
sets of coordinates to use as a knowledge base for structure 
analysis. A fast search and matching algorithm allows one to in- 
teractively model substructures from this database under condi- 
tions made difficult by a lack of high resolution data. 

Our computer graphics implementation of density skeletoniza- 
tion gives an improved overview of a possible chain trace. It also 
contains sufficient detail to build a model with fragment-fitting 
which is at least as good as can be obtained by careful manual 
fitting. The speed with which we can build a model from a 
skeleton makes it much easier to test chain tracing hypotheses. 

Materials and methods 

Diagonal plot algorithm for locating similar conformations 
Efficient techniques have been developed to find the best least squares fit of one 
set of points to another set (Kabsch, 1978; McLachian, 1979), The goodness-of- 
fit can then be judged by the r,m.s. deviation between one set of points and the 
correctly transformed second set. Alternative methods can be formulated; in par- 
ticular we have used the interatomic diagonal plot (Phillips, 1970). This consists 
3f a matrix of distances where element (J J) is the distance between points ; and 
I When these points represent protein Ca atoms, the plot can be used to recognize 
domains and structural motifs (Rossmann and Liljas, 1974). 

If two fragments have the same structure, they will also have the same set of 
inter-Ca distances. However, the reverse is not tme. Our distance matching 
algorithm is 35 times faster than a least squares algorithm when comparing five 
Joints. It is therefore used as a sieve to locate similarities which are then tested 
Afith the least squares algorithm. The goodness-of-fit of each fragment is judged 
by the sum: ^ 

vhere d„ is an inter-Ca distance in one structure and d„ is the equivalent distance 
n the second structure, and the sum is taken over the relevant distances. 

The protein Ca distances are pre-calculated by a Fortran program that accepts 
ill commonly used coordinate files. A fragment of five residues can be searched 
n a library of 34 proteins (containing 5271 residues) in -3 s on a Vax 750 
omputer. 

Zlectron density skeletonization 

This is a two stage procedure. The first step creates a set of linked points from 
n electron density map using essentially Greer's algorithm (Greer, 1974). This 
irst removes all points below a pre-set value. Multiple passes are then made 
^irough the map with an increasing threshold. A point will not be removed if 

hole is created, or if it is a tip or single point. All points with a value equal 
? the current threshold will then be removed unless they are needed to preserve 
ontinuity. This algorithm results in a connected trace of points which is sen- 
itive to the starting base value. We find that contoured electron density is best 
iewed at one standard deviation, while the skeleton is best calculated with a 
ase level and increment of - 1.3 and 1.0 SD, respectively. 

In the second stage each 'atom' in the skeleton is given a status defining it 

22 



as part of the main chain or of a side chain. This is done according to the length 
of the linked list containing the atom. The program also provides extra uncon- 
nected atoms that may be used later. 

Both programs are written in Fortran, and skeletonize a map of 56 x 49 x 77 
points in 14 min on a Vax 750. 

Graphics interface 

We have implemented our FRODO enhancements on a coloured line drawing 
Evans and Sutheriand PS330. The calculated skeleton can be changed by mov- 
ing its atoms, by re-defining connectivity, and by re-assigning the atomic status. 
A third stams is available which we normally use to define our currently accepted 
main chain trace. Colour is vital to show the current skeleton assignments (Figures 
3 and 4). 

Two fragment matching options are available. One places Ca atoms along a 
linked list of skeleton or protein atoms such that each is positioned - 3.8 A from 
its neighbour, unless forced to accept particular points as Cas. In the second 
option, one explicitly defines which Ca atoms in a protein fragment are to be 
used to make a match. 

The Ca traces of the 20 best matches can be viewed (Figure 6) and each can 
be seen in turn as a stripped poly-alanine chain (Figure I). The coordinates of 
any of these fits can be incorporated into the FRODO atomic data set. The fit 
of each residue can then be further improved either manually (Jones, 1978), 
automatically by real space methods (Jones and Llljas, 1984a) or with a new 
option that matches all of the residue, including the side chain, to the skeleton. 
Any combination of accepted main chain, side chain or automatically assigned 
main chain atoms can be viewed. This gives one the flexibility to view details 
or to get a large volume overview. 
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